In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. You might have to adjust the following code to use the correct file path on your computer.
comments <- readRDS("../data/LWT_Census_parsed.rds")
Next, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).
library(tidyverse)
comments <- comments %>%
mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
pattern = "\\\n",
replacement = " "))
Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.
library(quanteda)
toks <- comments %>%
pull(TextEmojiDeleted) %>%
char_tolower() %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
comments_dfm <- dfm(toks,
remove = quanteda::stopwords("english"))
term_freq.
textstat_frequency() from the quanteda package to answer this question.
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
## feature frequency rank docfreq group
## 1 census 1763 1 1340 all
## 2 people 991 2 728 all
## 3 just 752 3 654 all
## 4 like 619 4 526 all
## 5 one 520 5 432 all
## 6 trump 514 6 457 all
## 7 can 494 7 432 all
## 8 know 453 8 402 all
## 9 john 438 9 406 all
## 10 get 434 10 386 all
## 11 government 396 11 317 all
## 12 question 394 12 329 all
## 13 citizens 369 13 270 all
## 14 us 368 14 299 all
## 15 many 365 15 315 all
## 16 think 293 16 269 all
## 17 even 292 17 271 all
## 18 country 288 18 240 all
## 19 illegal 281 19 218 all
## 20 oliver 271 20 252 all
docfreq from the term_freq object you created in the previous task.
term_freq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 census 1763 1 1340 all
## 2 people 991 2 728 all
## 3 just 752 3 654 all
## 4 like 619 4 526 all
## 5 trump 514 6 457 all
## 6 one 520 5 432 all
## 7 can 494 7 432 all
## 8 john 438 9 406 all
## 9 know 453 8 402 all
## 10 get 434 10 386 all
We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).
emoji_toks <- comments %>%
mutate_at(c("Emoji"), list(~na_if(., "NA"))) %>%
mutate (Emoji = str_trim(Emoji)) %>%
filter(!is.na(Emoji)) %>%
pull(Emoji) %>%
tokens()
EmojiDfm <- dfm(emoji_toks)
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 114 1 67 all
## 2 emoji_rollingonthefloorlaughing 37 2 21 all
## 3 emoji_thinkingface 30 3 19 all
## 4 emoji_registered 14 4 4 all
## 5 emoji_grinningfacewithsweat 13 5 11 all
## 6 emoji_fire 12 6 3 all
## 7 emoji_grinningsquintingface 11 7 7 all
## 8 emoji_unamusedface 9 8 9 all
## 9 emoji_facewithrollingeyes 8 9 8 all
## 10 emoji_toilet 8 9 5 all
EmojiFreq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 114 1 67 all
## 2 emoji_rollingonthefloorlaughing 37 2 21 all
## 3 emoji_thinkingface 30 3 19 all
## 4 emoji_grinningfacewithsweat 13 5 11 all
## 5 emoji_unamusedface 9 8 9 all
## 6 emoji_facewithrollingeyes 8 9 8 all
## 7 emoji_grinningsquintingface 11 7 7 all
## 8 emoji_thumbsup 7 11 7 all
## 9 emoji_manshrugging 6 14 6 all
## 10 emoji_toilet 8 9 5 all
emoji_mapping_function.R file.
source("../scripts/emoji_mapping_function.R")
create_emoji_mappings(EmojiFreq, 10)
head(EmojiFreq, n = 10) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",
color = "black",
fill = "#FFCC4D") +
geom_point() +
labs(title = "Most frequent emojis in comments",
subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
\nhttps://www.youtube.com/watch?v=1aheRpmurAo&t=33s",
x = "",
y = "Frequency") +
scale_y_continuous(expand = c(0,0),
limits = c(0,150)) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
mapping1 +
mapping2 +
mapping3 +
mapping4 +
mapping5 +
mapping6 +
mapping7 +
mapping8 +
mapping9 +
mapping10